Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

IMO：GPT-4のような強いLLMを使って、人間による評価と同じ水準で呼応するLLM-as-a-judgeができるとのこと

Abstractより、チャットアシスタントに基づくLLMの評価

we explore using strong LLMs as judges to evaluate these models on more open-ended questions

We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: #MT-bench , a multi-turn question set; and #Chatbot_Arena , a crowdsourced battle platform. (Abstract)

agreementは「呼応」

4で評価

Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.

we study the LLM-as-a-judge approach by comparing it to the gold standard of human evaluation (1 Introduction)

Table 1

MT-Bench (2.2)

We create MT-bench, a benchmark consisting of 80 high-quality multi-turn questions.

Chatbot Arena (2.3)

Chatbot Arena, a crowdsourcing benchmark platform featuring anonymous battles.

users can interact with two anonymous models simultaneously, posing the same question to both.

投票

C.2 Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

3が積ん読（興味深い）

4 Agreement Evaluation

We randomly sample 3K single-turn votes from 30K arena data (4.1)

Figure 3

同じ系列のモデルを高く評価するバイアス（例：Claude (b)のグラフ）

We propose 3 LLM-as-a-judge variations (3.1)

Appendix AにPrompt Template

Pairwise comparison（Figure 5）

Single answer grading（Figure 6）

Reference-guided grading（Figure 8）

（Figure 10まである）

We present a few methods to address position bias and the limited grading ability for math questions (3.4)

Appendix B Case Studyでbiasの報告

Appendix D Additional Experimental Results

⚠️LLM構築におけるインストラクションの効果と人間とGPT-4による評価で観察されたものによると

Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.

Appendix (D.3?) によると同等を除いているらしい（関根さん）

「人が A 50%, B 50%であれば、GPT-4がAとしたら 50% 一致と判断する」（human-majority）

For example, if there are an equal number of “A” and “B” human votes for a question, and GPT4 votes “A”, the agreement is counted as 1/2 on this question.